We are looking to establish whether there is a correlation between the order a horse breezes and its breeze-up performance. Furthermore, we want to look at whether the breeze-up performance is indicative of future performance.
Information regarding the following variables was collected from 2017-2020 regarding horses sold at breeze-up sales:
Analysis will be carried out in R relying heavily on the following packages:
There are other packages that are used and these are all found in libraries.R
I have also conducted a small amount of further research in the postPresentation.Rmd file. This looks into a few of the ideas alluded to in Part 6.
We want to find out whether there is any correlation between the position in the order that a horse breezes in and the time in which it runs the breeze. Firstly, I will generate a summary table. I have grouped the horses into buckets by whether they breeze in the first, second or third segment of the horses that breeze. I have generated statistics based on these buckets. The statistics use time as Time2f.Vmed which is defined in the gloassry. I have also included a histogram showing that the horses are evenly distributed amongst the breeze order.
| BucketNo. | Min | Max | Mean | Median | IQR |
|---|---|---|---|---|---|
| 1 | -1.425 | 1.994 | -0.039 | -0.118 | 0.730 |
| 2 | -1.393 | 1.994 | 0.005 | -0.086 | 0.759 |
| 3 | -1.462 | 1.992 | 0.109 | 0.044 | 0.840 |
This shows that we have an equal number of horses running in each proportion of the breeze which is what we would hope for from our data.
From looking at the mean and median statistics it appears that horses that breeze earlier in the order have a lesser breeze time than those that run later. To test that this difference is significant I will conduct some t-tests on the time differences between the buckets. Since we have a hypothesis about the direction of the effect, we will conduct a one-tailed t-test. I test the difference between bucket 1 and bucket 2, between bucket 1 and bucket 3 and finally between bucket 2 and bucket 3.| Buckets Tested | t Value | Degrees Of Freedom | p Value |
|---|---|---|---|
| Bucket1 and Bucket2 | -1.740 | 2174.809 | 0.041 |
| Bucket1 and Bucket3 | -5.619 | 2168.079 | 0.000 |
| Bucket2 and Bucket3 | -3.938 | 2166.363 | 0.000 |
We can see from the P values that there is a significant difference in breeze times for the horses and that the probility of this occurring due to chance is well below our threshold p value of p = 0.05. This probability is the smallest for the difference between bucket 1 and bucket 3 as expected.
I have also included a linear regression which allows a more visual interpretation of the data. Here I have not divided the data up in buckets. I still use the 2 furlong time vs the median 2 furlong time (s).
| Term | Estimate | Standard Error | Regression Statistic | p Value |
|---|---|---|---|---|
| Intercept | -0.0511852 | 0.0194528 | -2.631 | 0.00854713 |
| Explanatory Variable | 0.0009976 | 0.0002124 | 4.697 | 0.00000275 |
We have established that there is a correlation between breeze order and breeze time. There are a number of things that could explain this.
This study will discuss the ability of breeze order to predict future ability of a horse. Specifically, this leads us onto the next question.
Having shown that there is a significant correlation between the order that a horse breezes and its breeze time I will now look at whether there is a correlation between the order in which a horse breezes and its future max rating. As in the last question, to start with, I have generated a summary table so that we can get a general idea of any correlation and the direction of the effect. We talk about max future rating as ORF max, this is defined in the glossary. The following table aims to show whether breeze order seems to have any correlation with ORFmax.
| BucketNo. | Min | Max | Mean | Median | IQR |
|---|---|---|---|---|---|
| 1 | 4 | 121 | 73.299 | 73.0 | 25.00 |
| 2 | 4 | 114 | 69.884 | 71.0 | 25.00 |
| 3 | 5 | 119 | 69.412 | 69.5 | 22.25 |
| Term | Estimate | Standard Error | Regression Statistic | p Value |
|---|---|---|---|---|
| Intercept | 73.87072 | 0.8485 | 87.058 | 0.00000000 |
| Explanatory Variable | -0.05978 | 0.0150 | -3.987 | 0.00006951 |
From looking at the above results we can see that there does appear to be the general trend that horses running later on in the breeze have a lower future max rating. We can also see that this correlation is not as pronounced as that between breeze time and future max rating.
I will now conduct a ttest in order to determine whether this trend that we can see is significant.
| Buckets Tested | t Value | Degrees Of Freedom | p Value |
|---|---|---|---|
| Bucket1 and Bucket2 | 3.216 | 1290.152 | 0.001 |
| Bucket1 and Bucket3 | 3.710 | 1276.594 | 0.000 |
| Bucket2 and Bucket3 | 0.436 | 1222.827 | 0.331 |
Upon ttesting the result for significance we see that there is a significant difference between the first and second buckets but not between the second and third. This suggests that breeze order may have some value to us and is worth investigating further. However, there are a few adjustments that we can make in order to generate some more valuable metadata. This is what we will do in the next question which involves generating an order score for the horses and using this rather than simply the percentage position.
The order of their own horses breezing is determined by the consignor. It interesting to look at whether there is any correlation between the size of the consignor and their average breezing position as if we adjust for this we might get a more accurate measure of how the horses are ordered relative to their predetermined future ability.
Here I have defined a small consignor as one that has had less than 25 horses at sale since we began collecting order data. To account for the different number of horses that run at each breeze I have used a percentage relative to the highest position at that sale to describe the breeze position.
| Small Seller | Average Breeze Position |
|---|---|
| FALSE | 47.250 |
| TRUE | 57.813 |
When comparing the average breeze position for a small seller against a larger seller we see that there is a clear difference between the average position of a small seller and a larger seller.
| Data Name | t Value | Degrees of Freedom | p Value |
|---|---|---|---|
| smallSellerSeq and largeSellerSeq | 9.353 | 1579.019 | 0 |
The t-test shows that this result is significant.
It would also be interesting to see if there are any particular consignors that stand out as having a more significant influence over position.
| Seller | Average Breeze Position | Total No. Sold |
|---|---|---|
| Aguiar Bloodstock | 45.147 | 34 |
| Ardglas Stables | 38.652 | 46 |
| Ballinahulla Stables | 37.970 | 33 |
| Ballycullen Stables | 52.133 | 30 |
| Bansha House Stables | 60.378 | 119 |
| Bloodstock Connection | 45.091 | 33 |
| Brown Island Stables | 37.359 | 78 |
| Bushy Park Stables | 60.481 | 27 |
| CAJ Stables | 35.385 | 39 |
| Derryconnor Stud | 55.040 | 50 |
| Egmont Stud | 49.855 | 76 |
| Gaybrook Lodge | 31.602 | 83 |
| Grove Stud | 39.078 | 90 |
| Horse Park Stud | 44.507 | 69 |
| Hyde Park Stud | 56.339 | 112 |
| Katie Walsh | 40.538 | 39 |
| Kilminfoyle House Stud | 55.062 | 32 |
| Knockanglass Stables | 58.452 | 177 |
| Knockgraffon Stables | 61.588 | 34 |
| Lackendarra Stables | 63.172 | 29 |
| Longway Stables | 49.138 | 87 |
| Lynn Lodge Stud | 46.417 | 36 |
| Malcolm Bastard | 44.024 | 41 |
| Mayfield Stables | 42.433 | 90 |
| Meadow View Stables | 36.603 | 68 |
| Mocklershill | 39.883 | 274 |
| Oak Farm Stables | 34.806 | 36 |
| Oak Tree Farm | 41.930 | 43 |
| Powerstown Stud | 48.176 | 68 |
| Shanaville Stables | 74.400 | 30 |
| Sherbourne Lodge | 69.280 | 82 |
| Star Bloodstock | 29.903 | 62 |
| TallyHo Stud | 46.650 | 180 |
| Yeomanstown Stud | 48.553 | 47 |
| Small Sellers | 57.813 | 898 |
This table is too big at the moment. It is not obvious what it is showing for a reader but does have some interesting results It helps us to highlight that there are clearly a few consignors that have horses breezing earlier than others.
This plot shows the breeze score with a normal distribution centered around a score 0 overlayed. The normal distribution is an okay fit
I have given a position score to each horse based on the average position for that consignor I have put all small consignors (those who have sold less than 25 horses) together when calculating this adjustment.
| Term | Estimate | Standard Error | Regression Statistic | p Value |
|---|---|---|---|---|
| Intercept | 70.91491 | 0.43538 | 162.882 | 0.00000 |
| Explanatory Variable | -0.03939 | 0.01603 | -2.458 | 0.01405 |
The higher p value here (compared to regression with pctSeqBreeze) suggest that orderScore is actually not as good a predictor of ORFmax as pctSeqBreeze. Therefore, for the rest of our analysis we will use pctSeqBreeze where neccessary.
The ultimate question that this study is asking is; if we look at the breeze time of a horse does having breeze order give us any additional information towards predicting the future value of the horse?
It is therefore important to try to separate order and breeze time, we have shown that they both have value in terms of predicting the future value of a horse, we now need to test for multicollinearity in order to see if they have value as individual variables.
From the correlation matrix and the plot we see again that pctBreezeOrder is a better value than orderScore for determining the future ability of a horse.
From the correlation matrix in part 4 it appears that the correlations between time and pctBreezeSeq is low enough that multicollinearity wont be too much of an issue.
## [1] "VIF = 1.10597496056056"
We usually only consider multi-collinearity to be a problem when VIF > 10. Therefore, we assume that multicollinearity is not a problem and can move onto the next section.
As we have seen that pctBreezeSeq is a better measure than order score we will use this for the rest of the analysis. Here we aim to determine whether we actually gain any value through the use of pctSeqBreeze. By this I mean does having this extra variable actually add any value to our model? There, is no point in adding a variable to our model if it doesn’t improve it. Is a model involving both time and pctSeqBreeze better than one that only involves time.
In our various models:
| Term | Estimate | Standard Error | Regression Statistic | p Value |
|---|---|---|---|---|
| Intercept | 73.87072 | 0.8485 | 87.058 | 0.00000000 |
| Explanatory Variable | -0.05978 | 0.0150 | -3.987 | 0.00006951 |
Our p-value of 0.00007 suggests that this negative trend is significant and pctSeqBreeze has value wrt to predicting ORFmax.
| Term | Estimate | Standard Error | Regression Statistic | p Value |
|---|---|---|---|---|
| Intercept | 70.102 | 0.4197 | 167.05 | 0.000e+00 |
| Explanatory Variable | -9.932 | 0.7134 | -13.92 | 4.841e-42 |
Our p-value of 4.841e-42 suggests that this negative trend is significant and Time2f.Vmed has value wrt to predicting ORFmax.
We have seen that both factors when used on their own have significant value in predicting the future ability of a horse. However, we want to know if when they are used together this provides a better model than using just Time on its own.
We can see from our residuals vs fitted plot that there is no evidence of the fan shape characteristic of heteroscedasticity. This would mean that as the fitted values increase the variance of the residuals also increases. This does not appear to be the case.
The next plot is the QQ-plot. Though most of the points seem to fall on the line which indicates that our residuals come from a normal distribution, there are some points that stray from the line in the lower and upper quantiles of the plot. It is possible that these points do not come from a normal distribution, but most of our points seem to come from a normal distribution so there is not a lot to worry about here.
The third plot created is the scale-location plot. This plot is similar to the residual plot, but uses the square root of the standardized residuals instead of the residuals themselves. This makes trends in residuals more evident.
Finally, we see the leverage plot. This plot graphs the standardized residuals against their leverage. It also includes the Cook’s distance boundaries. Any point outside of those boundaries would be an outlier in the x direction. Since we cannot even see the boundaries on our plot, we can conclude that we have no outliers.
| Est. | S.E. | t val. | p | |
|---|---|---|---|---|
| (Intercept) | 72.083 | 0.821 | 87.799 | 0.000 |
| pctSeqBreeze | -0.040 | 0.014 | -2.806 | 0.005 |
| Time2f.Vmed | -9.733 | 0.716 | -13.601 | 0.000 |
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | 70.473 | 73.693 |
| pctSeqBreeze | -0.069 | -0.012 |
| Time2f.Vmed | -11.137 | -8.330 |
| F(2, 1910) | R-sqd | Adj-R-sqd |
|---|---|---|
| 101.2061 | 0.0958 | 0.0949 |
Refer to modellingORFmax.Rmd for some additional information looking into the best models. This makes it look like the best model is the one that includes both pctSeqBreeze and Time2f.Vmed. However, even this model does not do as good a job as measuring price. There are clearly other factors that people are paying for and can observe.
The research into the correlation between breeze order and breeze time must be interpreted with caution. There is clearly a lot of noise when looing into the correlation between the two variables. The correlation that we are seeing may just be due to our large number of data points. However, I don’t believe this first section to have no value, so have left it in.
In the future it will be interesting to see how our findings relate to the prices that people end up paying for particular horses. It could be the case that people are placing too much emphasis on the Breeze Time of a horse and actually paying more for a horse than it is worth, purely due to this. We will need to somehow model the future value of a horse using ORFmax. I attempt to do this in postPresentation.Rmd.
It will also be useful to look at a similar question about breeze order but on a consignor level. An individual consignor gets allotted spots in the order and then gets to pick which horse of theirs they place in each of these spots. Therefore, a consignor picks completely the order of their own horses. If it is the case that a particular consignor always places their horses in the order that they think is representative of their future ability then this could be significant to us in predicting the future value of a horse. However, it is likely that different consignors have different tactics and that we only have enough data to elucidate the strategies of the largest consignors. Smaller consignors may not have sold enough horses to be able to tell this kind of thing.
We are also provided with a good basis for spotting edge cases. If a large consignor has placed a horse early in their order but it then runs a slow breeze time we might be inclined to underestimate the value of the horse if we do not take into account the fact that the consignor has chosen to place the horse early in their order. Does the consignor know something that the breeze time does not about the future ability of this particular horse. We may therefore be able to get good value on a horse that runs a slow breeze time through using the intuition of the consignor to uncover variables unknown to us.